nsexec: cloned binary rework #3987

cyphar · 2023-08-18T14:56:28Z

Split out from #3953 to be reviewed separately.

First, Migrate all of the memfd /proc/self/exe logic to Go code.
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.

Then, carry the code from #3983 and combine it with the above to
keep all of the runc-dmz logic in the Go code as well. The runc-dmz
feature is optional (controlled by the runc_nodmz build tag at
build time, or RUNC_DMZ=legacy at runtime), with a fallback to the
classic /proc/self/exe cloning. If /proc/self/exe is already a
cloned binary then neither runc-dmz nor /proc/self/exe are
cloned.

This also adds a memfd-bind helper binary for users which really
need to avoid any excess copies. Unfortunately, the semantics of
bind-mounting a memfd are quite ugly (they cannot be bind-mounted
directly -- instead, you must bind-mount the magic-link of a file
descriptor pointing to the memfd, which means you need a long-running
process to keep the magic-link alive).

Fixes #3965
Fixes #3973
Signed-off-by: Aleksa Sarai [email protected]

README.md

libcontainer/container_linux.go

AkihiroSuda · 2023-08-20T14:14:24Z

libcontainer/dmz/cloned_binary_linux.go

+	}
+	path := "/proc/self/fd/" + strconv.Itoa(int(f.Fd()))
+	if err := unix.Access(path, unix.X_OK); err == nil {
+		return true


The binary might be still unexecutable depending on LSM?
Maybe we should just execute the binary and see whether it returns a magic ret code?

We could, though doing a maximum of ~10 extra fork+execs in the absolute worst case seems a bit expensive... This also can't detect every LSM issue (it's possible executing a host /tmp program will be blocked by the container's LSM setup -- and we cannot detect that this early into container setup).

I think the only true solution for a really serious LSM setup is to disable runc-dmz entirely, or we'll have to create a tmpfs mountfd that allows executables, which would add more mount_lock lock contention again....

I think the new RUNC_DMZ=legacy option means that users can work around this if they hit an issue here. I don't think there's a nicer way of dealing with this problem completely tbh.

libcontainer/dmz/dmz_linux.go

libcontainer/dmz/runc-dmz.c

libcontainer/dmz/cloned_binary_linux.go

Makefile

switched to LANG=C.UTF-8

libcontainer/dmz/cloned_binary_linux.go

libcontainer/system/linux.go

libcontainer/dmz/cloned_binary_linux.go

.github/workflows/test.yml

Makefile

libcontainer/dmz/dmz_linux.go

libcontainer/dmz/dmz.go

libcontainer/dmz/dmz_linux.go

libcontainer/dmz/Makefile

libcontainer/dmz/dmz_linux.go

libcontainer/system/linux.go

.github/workflows/test.yml

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang <[email protected]>

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before.

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang [email protected]

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang [email protected] Signed-off-by: lfbzhm <[email protected]>

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>

In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. But because of we have added safeExe to the set of ExtraFiles, if the fd of safeExe is too small, go stdlib will dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>

In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 5 or 6 was closed at that time, maybe it will be reused by memfd. But because of we have added safeExe to the set of ExtraFiles, if the fd of safeExe is than stdio fds count + ExtraFiles count, go stdlib will dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>

In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 6 or 7 was closed at that time, maybe it will be reused by memfd. But because of we have added safeExe to the set of ExtraFiles, if the fd of safeExe is not bigger than stdio fds count + ExtraFiles count, go stdlib will dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>

In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 6 or 7 was closed at that time, maybe it will be reused by memfd. Because we want to add safeExe to the set of ExtraFiles, if the fd of safeExe is too small, go stdlib will dup3 it to another fd, or dup3 a other fd to this fd, then it will cause the fd type cmd.Path refers to a random path. (issue: opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>

In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 6 or 7 was closed at that time, maybe it will be reused by memfd. Because we want to add safeExe to the set of ExtraFiles, if the fd of safeExe is too small, go stdlib will dup3 it to another fd, or dup3 a other fd to this fd, then it will cause the fd type cmd.Path refers to a random path. (issue: opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>

This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>

cyphar added this to the 1.2.0 milestone Aug 18, 2023

cyphar marked this pull request as ready for review August 20, 2023 13:58

This was referenced Aug 20, 2023

[Proposal] Use runc-dmz to defeat CVE-2019-5736 #3983

Closed

nsexec: spring cleaning #3953

Closed

AkihiroSuda reviewed Aug 20, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

AkihiroSuda reviewed Aug 20, 2023

View reviewed changes

libcontainer/container_linux.go Show resolved Hide resolved

AkihiroSuda reviewed Aug 20, 2023

View reviewed changes

libcontainer/dmz/dmz_linux.go Outdated Show resolved Hide resolved

AkihiroSuda reviewed Aug 20, 2023

View reviewed changes

libcontainer/dmz/runc-dmz.c Outdated Show resolved Hide resolved

eiffel-fl reviewed Aug 21, 2023

View reviewed changes

libcontainer/dmz/cloned_binary_linux.go Outdated Show resolved Hide resolved

cyphar requested a review from AkihiroSuda August 21, 2023 16:08

lifubang added the status/2-code-review label Aug 22, 2023

lifubang previously requested changes Aug 22, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

Makefile Show resolved Hide resolved

cyphar requested a review from lifubang August 23, 2023 07:10